1. Introduction

The dataset provided gathers information related to a telemarketing campaign applied by a banking institution. Throughout the campaign, clients were contacted multiple times in regard to a term deposit subscription offer.

The goal of this project is to predict if a client will subscribe to the term deposit offer. To this end a classification analysis was conducted. The term deposit target variable is labelled ‘y’, and has the possible outcomes ‘yes’ or ‘no’.

2. Data Loading and Preprocessing

The first part of this report revolves around understanding the data. Here, the data quality is evaluated and some basic cleaning and pre-processing is undertaken in order to prepare the data for the baseline model which will act as a benchmark for all future data transformation in the feature engineering process. Within this pipe-line, pre-processing involves anything that can be done to get the data ready for a baseline model that does not require an EDA (Exploratory Data Analysis). This typically involves loading the data, merging it, looking for NA’s, looking for missing values, dropping meaningless variables and correcting variable types.

First, we load the necessary train and test data.

train = fread("BankCamp_train.csv", stringsAsFactors = T)
test = fread("BankCamp_test.csv", stringsAsFactors = T)

Here we switch the order of the factors to get metrics related to the “yes” target.

train$y <- factor(train$y, levels = c("yes", "no"))
test$y <- factor(test$y, levels = c("yes", "no"))

Finally we set the types for numerical variables:

train[ , which(sapply(train, is.integer)):=lapply(.SD,as.numeric), .SDcols = sapply(train, is.integer)]
test[ , which(sapply(test, is.integer)):=lapply(.SD,as.numeric), .SDcols = sapply(test, is.integer)]

2.1 Initial Exploration


We explore the first few rows, structure and summary of the training data to see if there are any inconsistencies.

We see there are neither inconsistent missing values nor duplicates in the set. However, we see that the “pdays” variable contains ‘-1’ values, which we assume indicates that the information is not applicable to the client.

First Rows

head(train)
   age          job marital education default balance housing loan
1:  50 entrepreneur married   primary     yes     537     yes   no
2:  47   technician married secondary      no    -938     yes   no
3:  56    housemaid married   primary      no     605      no   no
4:  36  blue-collar married   primary      no    4608     yes   no
5:  41   management married   primary      no     362     yes   no
6:  32      student  single  tertiary      no       0      no   no
    contact day month duration campaign pdays previous poutcome   y
1:  unknown  20   jun       11       15    -1        0  unknown  no
2:  unknown  28   may      176        2    -1        0  unknown  no
3: cellular  19   aug      207        6    -1        0  unknown  no
4: cellular  14   may      284        7    -1        0  unknown  no
5: cellular  12   may      217        3    -1        0  unknown  no
6: cellular   4   feb      233        3   276        2  failure yes

Structure

str(train)
Classes 'data.table' and 'data.frame':  36168 obs. of  17 variables:
 $ age      : num  50 47 56 36 41 32 26 60 39 55 ...
 $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 3 10 4 2 5 9 9 2 8 1 ...
 $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 3 2 1 1 ...
 $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 1 1 1 3 2 1 2 2 ...
 $ default  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
 $ balance  : num  537 -938 605 4608 362 ...
 $ housing  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 2 2 2 ...
 $ loan     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
 $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 1 1 1 1 1 2 1 3 ...
 $ day      : num  20 28 19 14 12 4 29 12 16 3 ...
 $ month    : Factor w/ 12 levels "apr","aug","dec",..: 7 9 2 9 9 4 5 9 1 7 ...
 $ duration : num  11 176 207 284 217 233 297 89 539 131 ...
 $ campaign : num  15 2 6 7 3 3 1 2 1 1 ...
 $ pdays    : num  -1 -1 -1 -1 -1 276 -1 -1 -1 -1 ...
 $ previous : num  0 0 0 0 0 2 0 0 0 0 ...
 $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 1 4 4 4 4 ...
 $ y        : Factor w/ 2 levels "yes","no": 2 2 2 2 2 1 2 2 2 2 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Summary

summary(train)
      age                 job           marital          education
 Min.   :18.00   blue-collar:7848   divorced: 4148   primary  : 5517
 1st Qu.:33.00   management :7512   married :21838   secondary:18557
 Median :39.00   technician :6060   single  :10182   tertiary :10598
 Mean   :40.96   admin.     :4181                    unknown  : 1496
 3rd Qu.:48.00   services   :3286
 Max.   :95.00   retired    :1826
                 (Other)    :5455
 default        balance      housing      loan            contact
 no :35509   Min.   :-8019   no :16034   no :30354   cellular :23408
 yes:  659   1st Qu.:   72   yes:20134   yes: 5814   telephone: 2334
             Median :  448                           unknown  :10426
             Mean   : 1360
             3rd Qu.: 1428
             Max.   :98417

      day            month          duration         campaign
 Min.   : 1.00   may    :11064   Min.   :   0.0   Min.   : 1.000
 1st Qu.: 8.00   jul    : 5551   1st Qu.: 103.0   1st Qu.: 1.000
 Median :16.00   aug    : 4945   Median : 180.0   Median : 2.000
 Mean   :15.81   jun    : 4292   Mean   : 258.8   Mean   : 2.764
 3rd Qu.:21.00   nov    : 3175   3rd Qu.: 320.2   3rd Qu.: 3.000
 Max.   :31.00   apr    : 2335   Max.   :4918.0   Max.   :63.000
                 (Other): 4806
     pdays           previous         poutcome       y
 Min.   : -1.00   Min.   : 0.000   failure: 3942   yes: 4195
 1st Qu.: -1.00   1st Qu.: 0.000   other  : 1444   no :31973
 Median : -1.00   Median : 0.000   success: 1198
 Mean   : 40.08   Mean   : 0.574   unknown:29584
 3rd Qu.: -1.00   3rd Qu.: 0.000
 Max.   :854.00   Max.   :58.000
                                                              

Missing

sapply(train, function(x) sum(is.na(x)))
      age       job   marital education   default   balance   housing
        0         0         0         0         0         0         0
     loan   contact       day     month  duration  campaign     pdays
        0         0         0         0         0         0         0
 previous  poutcome         y
        0         0         0 

Duplicates

any(duplicated(train))
[1] FALSE

3. Exploratory Data Analysis (EDA)

In the first step, we explore the distribution of the variables, in order to gain insights of the data and inform decisions on the later feature engineering and modelling process.

3.1 Target Variable Distribution

sort(summary(train$y), dec=T)/nrow(train)
       no       yes
0.8840135 0.1159865 
p0<-ggplot(train, aes(x=y))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4", label = TRUE)+
  theme(axis.text.x = element_text(angle=0))+ labs(title = "Distribution of Clients Subscribed")+
  xlab("y")+ ylab("Count")
train[, job:=factor(job, levels=names(sort(summary(train$job), dec=T)))]
levels(train$job)
 [1] "blue-collar"   "management"    "technician"    "admin."
 [5] "services"      "retired"       "self-employed" "entrepreneur"
 [9] "unemployed"    "housemaid"     "student"       "unknown"      
p0


We see that the target variable is heavily unbalanced, with over 85% of the values denoting ‘no’. In the business context this is explained by the fact that the term deposit subscription is not targeted towards the majority of clients.

For later modelling and model evaluation, this poses a challenge, as the model is likely to achieve a high accuracy by purely training on and predicting ‘no’ targets, despite having a very low sensitivity score (namely being poor at correctly indentifying true positive targets).

3.2 Numerical Variable Distributions

3.21 Numeric Correlation Plots

Next, a correlation plot is constructed to see if there is any significant correlation between numeric explanatory variables.

numerical <- select_if(train,is.numeric)
corr_plot <- ggcorr(numerical,  label_round=2, label = TRUE)
corr_plot

As we can see, no variables have significant correlations, apart from “previous” and “pdays”. We therefore do not need to spend a lot of time on transforming or dropping variables purely based on multivariate correlation between numeric explanatory variables.

Here is an interactive version of the correlation matrix.

plot_ly(x = colnames(train),
        y=colnames(train),
        z = cor(numerical), colors = colorRamp(c("dodgerblue4", "brown2")),
        type = "heatmap") %>% layout(title="Interactive Correlation Matrix")

3.22 Numerical Variables Plot

num_y <- train[, c('age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous','y')]
ggpairs(num_y[,1:8], title = "Distribution & Correlation graph", ggplot2::aes(colour=y, alpha=1/100))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In the above graph we see how the numeric variables relate to each other and to the target variable. We can see through the diagram diagonal, that their might be trends within the continuous distributions of the explanatory variables that match up more with certain ‘yes’, or certain ‘no’ factors and can explore using these trends to bin variables in the feature engineering phase.

We see that many of the variables are skewed, so we will consider normalizing the scales of the numeric variables at a later stage.

3.3 Categorical Variable Distribution


In this subsection, we explore the categorical variable distributions to inform on later feature engineering decisions.

3.31 Job Distribution

sort(summary(train$job), dec=T)/nrow(train)
  blue-collar    management    technician        admin.      services
  0.216987392   0.207697412   0.167551427   0.115599425   0.090853793
      retired self-employed  entrepreneur    unemployed     housemaid
  0.050486618   0.035003318   0.032736120   0.028257023   0.027897589
      student       unknown
  0.020874806   0.006055076 
p1<-ggplot(train, aes(x=job))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
  theme(axis.text.x = element_text(angle=90))+ labs(title = "Distribution by Kind of Job")+
  xlab("Job")+ ylab("Count")
train[, job:=factor(job, levels=names(sort(summary(train$job), dec=T)))]
levels(train$job)
 [1] "blue-collar"   "management"    "technician"    "admin."
 [5] "services"      "retired"       "self-employed" "entrepreneur"
 [9] "unemployed"    "housemaid"     "student"       "unknown"      
p1

We see that a large portion of potential targets is made up of blue collar and management related jobs.

3.32 Marital Status Distribution

sort(summary(train$marital), dec=T)/nrow(train)
  married    single  divorced
0.6037934 0.2815196 0.1146870 
p2<-ggplot(train, aes(x=marital))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
  theme(axis.text.x = element_text(angle=0))+ labs(title = "Distribution by Marital Status")+
  xlab("Marital Status")+ ylab("Count")
train[, marital:=factor(marital, levels=names(sort(summary(train$marital), dec=T)))]
levels(train$marital)
[1] "married"  "single"   "divorced"
p2

Most of the historic potential targets were married, with a fraction being divorced.

3.33 Education Level Distribution

sort(summary(train$education), dec=T)/nrow(train)
 secondary   tertiary    primary    unknown
0.51307786 0.29302146 0.15253816 0.04136253 
p3<-ggplot(train, aes(x=education))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
  theme(axis.text.x = element_text(angle=0))+labs(title = "Distribution by Education Level")+
  xlab("Education Level")+ ylab("Count")
train[, education:=factor(education, levels=names(sort(summary(train$education), dec=T)))]
levels(train$education)
[1] "secondary" "tertiary"  "primary"   "unknown"  
p3

Most historic potential targets have at least secondary education, making them good candidates from a business point of view.

3.34 Month Distribution

sort(summary(train$month), dec=T)/nrow(train)
       may        jul        aug        jun        nov        apr
0.30590577 0.15347821 0.13672307 0.11866844 0.08778478 0.06455983
       feb        jan        oct        sep        mar        dec
0.05786883 0.03121544 0.01645101 0.01205486 0.01064477 0.00464499 
train$month <- factor(train$month, levels = c( "jan", "feb", "mar", "apr","may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"))
p4<-ggplot(train, aes(x=month))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
  theme(axis.text.x = element_text(angle=90))+labs(title = "Distribution by Month")+
  xlab("Month")+ ylab("Count")
levels(train$month)
 [1] "jan" "feb" "mar" "apr" "may" "jun" "jul" "aug" "sep" "oct" "nov"
[12] "dec"
p4

We see spikes of subscriptions in May, sustained throughout the summer and dropping during the rest of the year.

3.35 Loan Acquisition Distribution

sort(summary(train$loan), dec=T)/nrow(train)
       no       yes
0.8392502 0.1607498 
p5<-ggplot(train, aes(x=loan))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
  theme(axis.text.x = element_text(angle=0))+labs(title = "Loans Acquired")+
  xlab("Loans")+ ylab("Count")
p5

We see that the majority of potential targets do not have a loan.

3.36 Housing Loan Acquisition Distribution

sort(summary(train$housing), dec=T)/nrow(train)
      yes        no
0.5566799 0.4433201 
p6<-ggplot(train, aes(x=housing))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
  theme(axis.text.x = element_text(angle=0))+labs(title = "Housing Loan Acquired")+
  xlab("Housing Loan")+ ylab("Count")
p6

The housing loans however, are more balanced.

3.37 Contact Type Distribution

sort(summary(train$contact), dec=T)/nrow(train)
  cellular    unknown  telephone
0.64720195 0.28826587 0.06453218 
p7<-ggplot(train, aes(x=contact))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
  theme(axis.text.x = element_text(angle=0))+labs(title = "Contact Type")+
  xlab("Contact")+ ylab("Count")
p7

Most potential applicants can be reached by cellular phone, a significant portion however cannot be reached, as they are unknown.

3.38 Previous Campaign Outcome Distribution

sort(summary(train$poutcome), dec=T)/nrow(train)
  unknown   failure     other   success
0.8179606 0.1089914 0.0399248 0.0331232 
p8<-ggplot(train, aes(x=poutcome))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
  theme(axis.text.x = element_text(angle=0))+labs(title = "Previous Campaign Outcome")+
  xlab("Outcome")+ ylab("Count")
p8

In this graph most values are unknown, for this reason the variable has low explanatory power for the model.

3.4 Bivariate Analysis


In the bivariate analysis, we can see what proportion of clients have subscribed and not subscribed according to each variable. We further explore trends amongst the explanatory variables, to inform further feature creation.

3.41 Marital

p9<-ggplot(train,
           aes(x = marital,
               fill = y)) +
  geom_bar(position = "dodge")+ scale_fill_manual(values=c("dodgerblue3", "dodgerblue4"))+
  labs(title = "Term Deposit Subscription by Marital Status")+xlab("Marital Status")+ ylab("Count")
p9

We see that the proportion of married clients that seem to go for a term deposit subscription is smaller than single or divorced people.

3.42 Education

p10<-ggplot(train,
           aes(x = education,
               fill = y)) +
  geom_bar(position = "dodge")+ scale_fill_manual(values=c("dodgerblue3", "dodgerblue4"))+
  labs(title = "Term Deposit Subscription by Education Level")+xlab("Education Level")+ ylab("Count")
p10

Tertiary education people seems to be more tend to go for a term deposit subscription, than primary or secondary.

3.43 Job + Term Deposit bar chart (count)

p11<-ggplot(train,
           aes(x = job,
               fill = y)) +
  geom_bar(position = "dodge")+ scale_fill_manual(values=c("dodgerblue3", "dodgerblue4"))+ theme(axis.text.x = element_text(angle=90))+
  labs(title = "Term Deposit Subscription by Job")+xlab("Job")+ ylab("Count")
p11

It seems that student clients go for a term deposit subscription more than clients working or retired.

3.44 Age + Term Deposit bar chart (count)

p12<-ggplot(train,
           aes(x = age,
               fill = y)) +
  geom_bar(position = "dodge")+ scale_fill_manual(values=c("dodgerblue3", "dodgerblue4"))+ theme(axis.text.x = element_text(angle=90))+
  labs(title = "Term Deposit Subscription by Age")+xlab("Age")+ ylab("Count")
p12

It seems that middle-aged clients, from 30 to 35 years, go for a term deposit subscription more than younger and older clients.

3.5 Outliers Analysis


In the following section we analyze the distribution of the numerical explanatory variables according to their response to the term deposit subscription.

3.51 Call Duration

p13<-ggplot(train, aes(x=as.factor(y), y=duration)) +
  geom_boxplot(fill="dodgerblue3", color="dodgerblue4")+
  theme_classic()+labs(title = "Call Duration by Term Deposit Subscription")+xlab("Term deposit")+
  ylab("Call Duration (sec.)")
p13

3.52 Days form last contact

p14<-ggplot(train, aes(x=as.factor(y), y=pdays)) +
  geom_boxplot(fill="dodgerblue3", color="dodgerblue4")+
  theme_classic()+labs(title = "Days from last contact by Term Deposit Subscription")+xlab("Term deposit")+
  ylab("Days from last contact")
p14

3.53 Contacts before this Campaign

p15<-ggplot(train, aes(x=as.factor(y), y=previous)) +
  geom_boxplot(fill="dodgerblue3", color="dodgerblue4")+
  theme_classic()+labs(title = "Contacts before this Campaign by Term Deposit Subscription")+xlab("Term deposit")+
  ylab("Contacts before this campaign")
p15

3.54 Contacts during this Campaign

p16<-ggplot(train, aes(x=as.factor(y), y=campaign)) +
  geom_boxplot(fill="dodgerblue3", color="dodgerblue4")+
  theme_classic()+labs(title = "Contacts during this Campaign by Term Deposit Subscription")+xlab("Term deposit")+
  ylab("Contacts during this campaign")
p16

3.6 Coefficient of Variation

The coefficient of variation is a dimensionless meassure of dispersion in data, the lower the value the less dispersion a feature has.

numeric_variables<-names(numerical)
sd_numeric_variables<-sapply(train[,numeric_variables, with=F], sd)
cv_numeric_variables<-sd_numeric_variables/colMeans(train[,numeric_variables, with=F])

ggplot(data.table(var=names(cv_numeric_variables),cv=cv_numeric_variables),
       aes(var,fill=cv))+geom_bar()+coord_polar()+scale_fill_gradient(low='white', high = 'dodgerblue4')

Viewing variables with less than a 0.05 coefficient of variation. There are none.

cv_numeric_variables[cv_numeric_variables < 0.05]
named numeric(0)

3.7 Linear Dependencies

Caret finds sets of linear combinations to remove them, however there are none.

lc<-findLinearCombos(train[, ..numeric_variables])
lc
$linearCombos
list()

$remove
NULL

4. Baseline

4.1 Setting up Preprocessing: Centering, Scaling and performing a Yeo Johnson transformation

To reduce the weight/importance of certain features due to their scale, we center them and scale them. We also reduce skewness by performing a Yeo Johnson transformation since it can handle negative values. This combination will be performed on each model training.

pp<-c("center", "scale", "YeoJohnson")
preProcess(train, method =pp)
Created from 36168 samples and 17 variables

Pre-processing:
  - centered (7)
  - ignored (10)
  - scaled (7)
  - Yeo-Johnson transformation (6)

Lambda estimates for Yeo-Johnson transformation:
-0.19, 0.84, 0.69, 0.13, -1.08, -0.6

4.2 Dummy Encoding

We dummy encode the factor variables to be able to run different models on the data.

dummy<-dummyVars(formula= ~., data = train[, -"y"], sep = "_")
final_train<-data.table(predict(dummy, newdata = train[, -"y"]))
final_train$y<-train$y
final_test<-data.table(predict(dummy, newdata = test))

4.3 Train Function

Now we define our training function, it will perform a 5 fold cross validation on 80% of the training data and then a final validation on the remaining 20% of the data that acts as a holdout. It also does the aforementioned preprocessing.

train_val<- function(train_dt, model, sampling){
  tc<-trainControl(
    method = "cv",
    number=5,
    savePredictions = TRUE,
    classProbs=TRUE,
    summaryFunction = prSummary)

  trainIndex <- createDataPartition(train_dt$y, p = .8, list = FALSE)
  model_train <- train_dt[ trainIndex,]
  holdout  <- train_dt[-trainIndex,]

  if(!missing(sampling)){
    if(sampling == 'over'){
      model_train<-upSample(x = model_train[, -"y"],y = model_train$y, yname="y")
    }
    else if(sampling == 'under'){
      model_train<-downSample(x = model_train[, -"y"],y = model_train$y, yname="y")
    }
    else {
      model_train<-SMOTE(y ~ ., data  = model_train)
    }
  }

  ini<-now()
  model<- train(y~ ., data = model_train, method = model, metric="Recall", trControl=tc, preProcess=pp)
  message("Cross Validation Scores having Yes as the positive class")
  message(model$results)
  message("Train + Predict time:")
  message(now()-ini)

  predicted = predict(model, newdata = holdout)

  print("Holdout Scores")
  print(confusionMatrix(table(predicted, holdout$y), positive="yes", mode="everything"))

  return(model)
}

4.4 Logistic Regression

Now we will run a simple logistic regression as a baseline, this will serve as a way of knowing if our feature engineering steps improve performance.

lm <- train_val(final_train, "glm")
Cross Validation Scores having Yes as the positive class
10.553077543742420.6436530298069690.34684284294940.4507432354300750.0149883538944520.01689834950935990.0143103103567690.0159508589731401
Train + Predict time:
1.29214898347855
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  283  140
      no   556 6254

               Accuracy : 0.9038
                 95% CI : (0.8967, 0.9105)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 4.007e-08

                  Kappa : 0.402

 Mcnemar's Test P-Value : < 2.2e-16

            Sensitivity : 0.33731
            Specificity : 0.97810
         Pos Pred Value : 0.66903
         Neg Pred Value : 0.91836
              Precision : 0.66903
                 Recall : 0.33731
                     F1 : 0.44849
             Prevalence : 0.11600
         Detection Rate : 0.03913
   Detection Prevalence : 0.05848
      Balanced Accuracy : 0.65771

       'Positive' Class : yes
                                          

So our Sensitivity or Recall is very low based on the “yes” class, meaning that we correctly classify a really small set of the “yes” cases.

5. Feature Engineering

Here we merge the two datasets to perform multiple feature engineering steps

y <- final_train$y
final_train$y <- NULL
merged_df <- rbind(final_train, final_test)

5.1 Days to end of month assuming all the months have 30 days

We create a variable that measures days to end of month, to see if the date contacted could have any explanatory value in our prediction.

merged_df$days_to_end_of_month <- 30 - merged_df$day

5.2 Balance General Status -> 1 if positive, 0 if negative or 0

We booleanize the ‘balance’ variable, to make the distinction between having a balance and not, more explicit.

positive_balance <- function(number){
  is_positive = 0
  if(number>0){is_positive <- 1}
  return(is_positive)
}
merged_df$balance_general_status <- as.numeric(lapply(merged_df$balance,  FUN = positive_balance))

5.3 Quantity of loans 0, 1 or 2 (housing and personal)

We create a quantity of loans category, since having a loan or more can make a difference in the decision of a potential client.

merged_df$quantity_loans <- merged_df$housing_yes+merged_df$loan_yes

5.4 Days_binned_weeks

We bin days to weeks incase there is correlation between target and which week the target was contacted.

week_type <- function(day){
  week_num=0
  if(day < 7){
    week_num <-1
  }else if(day< 15){
    week_num <-2
  }else if(day<22){
    week_num <-3
  }else if(day<30){
    week_num<-4
  }else{week_num <-5}
  return(week_num)
}
merged_df$week <- as.factor(as.numeric(lapply(merged_df$day, FUN = week_type)))
dmy<-dummyVars("~.",data = merged_df)
merged_df<-data.table(predict(dmy, newdata = merged_df))

5.5 Season Quarters

In the EDA we have seen that trends differ to seasons. We therefore bin for seasons.

merged_df$Q1 <- merged_df$month_jan+merged_df$month_feb+merged_df$month_mar
merged_df$Q2 <- merged_df$month_apr+merged_df$month_may+merged_df$month_jun
merged_df$Q3 <- merged_df$month_jul+merged_df$month_aug+merged_df$month_sep
merged_df$Q4 <- merged_df$month_oct+merged_df$month_nov+merged_df$month_dec

train_featured <- merged_df[1:nrow(train),]
train_featured$y <- y
test_featured <- merged_df[(nrow(train)+1):nrow(merged_df),]

6. Modeling

Here we define a number of models, to see which one will preform the best.

6.1 Logistic Regression


LR Over

lm_over <- train_val(train_featured, "glm","over")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  691  952
      no   148 5442

               Accuracy : 0.8479
                 95% CI : (0.8394, 0.8561)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 1

                  Kappa : 0.4764

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.82360
            Specificity : 0.85111
         Pos Pred Value : 0.42057
         Neg Pred Value : 0.97352
              Precision : 0.42057
                 Recall : 0.82360
                     F1 : 0.55681
             Prevalence : 0.11600
         Detection Rate : 0.09553
   Detection Prevalence : 0.22715
      Balanced Accuracy : 0.83735

       'Positive' Class : yes
                                          

LR Under

lm_under <- train_val(train_featured, "glm","under")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  695  959
      no   144 5435

               Accuracy : 0.8475
                 95% CI : (0.839, 0.8557)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 1

                  Kappa : 0.4771

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.82837
            Specificity : 0.85002
         Pos Pred Value : 0.42019
         Neg Pred Value : 0.97419
              Precision : 0.42019
                 Recall : 0.82837
                     F1 : 0.55756
             Prevalence : 0.11600
         Detection Rate : 0.09609
   Detection Prevalence : 0.22867
      Balanced Accuracy : 0.83919

       'Positive' Class : yes
                                         

LR SMOTE

lm_S <- train_val(train_featured, "glm","SMOTE")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  621  785
      no   218 5609

               Accuracy : 0.8613
                 95% CI : (0.8532, 0.8692)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 1

                  Kappa : 0.4773

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.74017
            Specificity : 0.87723
         Pos Pred Value : 0.44168
         Neg Pred Value : 0.96259
              Precision : 0.44168
                 Recall : 0.74017
                     F1 : 0.55323
             Prevalence : 0.11600
         Detection Rate : 0.08586
   Detection Prevalence : 0.19439
      Balanced Accuracy : 0.80870

       'Positive' Class : yes
                                          

6.2 Random Forest


RF Over

rf_over <- train_val(train_featured, "ranger","over")
Growing trees.. Progress: 84%. Estimated remaining time: 6 seconds.
Growing trees.. Progress: 65%. Estimated remaining time: 16 seconds.
Growing trees.. Progress: 89%. Estimated remaining time: 3 seconds.
Growing trees.. Progress: 66%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 89%. Estimated remaining time: 3 seconds.
Growing trees.. Progress: 67%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 89%. Estimated remaining time: 3 seconds.
Growing trees.. Progress: 66%. Estimated remaining time: 16 seconds.
Growing trees.. Progress: 88%. Estimated remaining time: 4 seconds.
Growing trees.. Progress: 67%. Estimated remaining time: 15 seconds.
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  454  366
      no   385 6028

               Accuracy : 0.8962
                 95% CI : (0.8889, 0.9031)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 0.0005591

                  Kappa : 0.4887

 Mcnemar's Test P-Value : 0.5112907

            Sensitivity : 0.54112
            Specificity : 0.94276
         Pos Pred Value : 0.55366
         Neg Pred Value : 0.93997
              Precision : 0.55366
                 Recall : 0.54112
                     F1 : 0.54732
             Prevalence : 0.11600
         Detection Rate : 0.06277
   Detection Prevalence : 0.11337
      Balanced Accuracy : 0.74194

       'Positive' Class : yes
                                          

RF Under

rf_under <- train_val(train_featured, "ranger","under")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  755 1020
      no    84 5374

               Accuracy : 0.8474
                 95% CI : (0.8389, 0.8556)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 1

                  Kappa : 0.4987

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.8999
            Specificity : 0.8405
         Pos Pred Value : 0.4254
         Neg Pred Value : 0.9846
              Precision : 0.4254
                 Recall : 0.8999
                     F1 : 0.5777
             Prevalence : 0.1160
         Detection Rate : 0.1044
   Detection Prevalence : 0.2454
      Balanced Accuracy : 0.8702

       'Positive' Class : yes
                                          

RF SMOTE

rf_S <- train_val(train_featured, "ranger","SMOTE")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  646  710
      no   193 5684

               Accuracy : 0.8752
                 95% CI : (0.8673, 0.8827)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 0.9905

                  Kappa : 0.5198

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.76996
            Specificity : 0.88896
         Pos Pred Value : 0.47640
         Neg Pred Value : 0.96716
              Precision : 0.47640
                 Recall : 0.76996
                     F1 : 0.58861
             Prevalence : 0.11600
         Detection Rate : 0.08931
   Detection Prevalence : 0.18747
      Balanced Accuracy : 0.82946

       'Positive' Class : yes
                                          

6.3 XGBoosting Tree


XGB Over

xgb_over <- train_val(train_featured, "xgbTree","over")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  735  891
      no   104 5503

               Accuracy : 0.8624
                 95% CI : (0.8543, 0.8703)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 1

                  Kappa : 0.5234

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.8760
            Specificity : 0.8607
         Pos Pred Value : 0.4520
         Neg Pred Value : 0.9815
              Precision : 0.4520
                 Recall : 0.8760
                     F1 : 0.5963
             Prevalence : 0.1160
         Detection Rate : 0.1016
   Detection Prevalence : 0.2248
      Balanced Accuracy : 0.8683

       'Positive' Class : yes
                                          

XGB Under

xgb_under <- train_val(train_featured, "xgbTree","under")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  739 1023
      no   100 5371

               Accuracy : 0.8447
                 95% CI : (0.8362, 0.853)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 1

                  Kappa : 0.4877

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.8808
            Specificity : 0.8400
         Pos Pred Value : 0.4194
         Neg Pred Value : 0.9817
              Precision : 0.4194
                 Recall : 0.8808
                     F1 : 0.5682
             Prevalence : 0.1160
         Detection Rate : 0.1022
   Detection Prevalence : 0.2436
      Balanced Accuracy : 0.8604

       'Positive' Class : yes
                                         

XGB SMOTE

xgb_S <- train_val(train_featured, "xgbTree","SMOTE")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  615  477
      no   224 5917

               Accuracy : 0.9031
                 95% CI : (0.896, 0.9098)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 1.15e-07

                  Kappa : 0.5822

 Mcnemar's Test P-Value : < 2.2e-16

            Sensitivity : 0.73302
            Specificity : 0.92540
         Pos Pred Value : 0.56319
         Neg Pred Value : 0.96352
              Precision : 0.56319
                 Recall : 0.73302
                     F1 : 0.63698
             Prevalence : 0.11600
         Detection Rate : 0.08503
   Detection Prevalence : 0.15097
      Balanced Accuracy : 0.82921

       'Positive' Class : yes
                                         

6.4 Regression with Lasso regularization


GLR Over

glmnet_over <- train_val(train_featured, "glmnet","over")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  693  946
      no   146 5448

               Accuracy : 0.849
                 95% CI : (0.8406, 0.8572)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 1

                  Kappa : 0.4794

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.82598
            Specificity : 0.85205
         Pos Pred Value : 0.42282
         Neg Pred Value : 0.97390
              Precision : 0.42282
                 Recall : 0.82598
                     F1 : 0.55932
             Prevalence : 0.11600
         Detection Rate : 0.09581
   Detection Prevalence : 0.22660
      Balanced Accuracy : 0.83902

       'Positive' Class : yes
                                          

GLR Under

glmnet_under <- train_val(train_featured, "glmnet","under")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  682  885
      no   157 5509

               Accuracy : 0.8559
                 95% CI : (0.8476, 0.864)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 1

                  Kappa : 0.4898

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.81287
            Specificity : 0.86159
         Pos Pred Value : 0.43523
         Neg Pred Value : 0.97229
              Precision : 0.43523
                 Recall : 0.81287
                     F1 : 0.56692
             Prevalence : 0.11600
         Detection Rate : 0.09429
   Detection Prevalence : 0.21665
      Balanced Accuracy : 0.83723

       'Positive' Class : yes
                                         

GLR SMOTE

glmnet_S <- train_val(train_featured, "glmnet","SMOTE")
[1] "Holdout Scores"
Confusion Matrix and Statistics


predicted  yes   no
      yes  654  736
      no   185 5658

               Accuracy : 0.8727
                 95% CI : (0.8648, 0.8803)
    No Information Rate : 0.884
    P-Value [Acc > NIR] : 0.9986

                  Kappa : 0.5169

 Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.77950
            Specificity : 0.88489
         Pos Pred Value : 0.47050
         Neg Pred Value : 0.96834
              Precision : 0.47050
                 Recall : 0.77950
                     F1 : 0.58681
             Prevalence : 0.11600
         Detection Rate : 0.09042
   Detection Prevalence : 0.19217
      Balanced Accuracy : 0.83220

       'Positive' Class : yes
                                          

6.5 Results

We print the results to compare the different models.

results <- resamples(list(
    LMO = lm_over,
    LMU = lm_under,
    LMS = lm_S,
    RFO = rf_over,
    RFU = rf_under,
    RFS = rf_S,
    XGBO = xgb_over,
    XGBU = xgb_under,
    XGBS = xgb_S,
    GLO = glmnet_over,
    GLU = glmnet_under,
    GLS = glmnet_S)
)

summary(results)

Call:
summary.resamples(object = results)

Models: LMO, LMU, LMS, RFO, RFU, RFS, XGBO, XGBU, XGBS, GLO, GLU, GLS
Number of resamples: 5

AUC
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
LMO  0.8878020 0.8903696 0.8909167 0.8909165 0.8920046 0.8934896    0
LMU  0.8855870 0.8885111 0.8945513 0.8946716 0.9015525 0.9031563    0
LMS  0.8623896 0.8625047 0.8787070 0.8737488 0.8810630 0.8840797    0
RFO  0.4978557 0.5001908 0.5004142 0.5050155 0.5073429 0.5192739    0
RFU  0.8829104 0.8888379 0.8899093 0.8938761 0.8939620 0.9137610    0
RFS  0.9613913 0.9624824 0.9626484 0.9637817 0.9635713 0.9688150    0
XGBO 0.9306198 0.9309255 0.9321529 0.9325274 0.9324328 0.9365060    0
XGBU 0.8891333 0.8908632 0.8981468 0.8963347 0.9008828 0.9026474    0
XGBS 0.9684089 0.9687097 0.9692926 0.9696295 0.9706853 0.9710509    0
GLO  0.8915520 0.8945663 0.8954650 0.8950614 0.8959636 0.8977602    0
GLU  0.8756316 0.8848797 0.8882006 0.8894145 0.8980499 0.9003107    0
GLS  0.8574968 0.8665558 0.8673677 0.8689898 0.8755728 0.8779559    0

F
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
LMO  0.8332179 0.8350313 0.8368093 0.8363630 0.8380087 0.8387479    0
LMU  0.8111782 0.8230828 0.8240602 0.8310036 0.8438438 0.8528529    0
LMS  0.8155194 0.8162025 0.8166794 0.8215625 0.8274994 0.8319117    0
RFO  0.9659992 0.9664493 0.9664620 0.9675197 0.9687441 0.9699441    0
RFU  0.8605974 0.8622927 0.8625455 0.8663620 0.8701299 0.8762447    0
RFS  0.9269240 0.9276760 0.9288317 0.9313133 0.9356952 0.9374394    0
XGBO 0.8969868 0.8983051 0.9020127 0.9010425 0.9022016 0.9057065    0
XGBU 0.8554996 0.8563135 0.8659341 0.8629170 0.8673993 0.8694384    0
XGBS 0.8960483 0.8968096 0.9016270 0.9005217 0.9026680 0.9054554    0
GLO  0.8364216 0.8382878 0.8387609 0.8389365 0.8402494 0.8409629    0
GLU  0.8199697 0.8238806 0.8315630 0.8297333 0.8337096 0.8395437    0
GLS  0.8147405 0.8157236 0.8169869 0.8175628 0.8184802 0.8218830    0

Precision
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
LMO  0.8369992 0.8439944 0.8464307 0.8443671 0.8470447 0.8473663    0
LMU  0.8223583 0.8328267 0.8390093 0.8407451 0.8502269 0.8593041    0
LMS  0.8223120 0.8322148 0.8368942 0.8342632 0.8395503 0.8403447    0
RFO  0.9345760 0.9352715 0.9354190 0.9373620 0.9397280 0.9418155    0
RFU  0.8242507 0.8351955 0.8380952 0.8366455 0.8423295 0.8433566    0
RFS  0.9015933 0.9068884 0.9087452 0.9095903 0.9150451 0.9156798    0
XGBO 0.8653706 0.8664850 0.8724869 0.8701277 0.8730565 0.8732394    0
XGBU 0.8263889 0.8345120 0.8515850 0.8436317 0.8526466 0.8530259    0
XGBS 0.8982301 0.8999019 0.9004004 0.9031675 0.9086269 0.9086781    0
GLO  0.8374757 0.8490040 0.8499096 0.8483190 0.8523002 0.8529053    0
GLU  0.8263473 0.8325653 0.8419453 0.8409974 0.8469861 0.8571429    0
GLS  0.8223120 0.8227848 0.8321318 0.8304507 0.8325629 0.8424622    0

Recall
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
LMO  0.8227131 0.8274042 0.8291634 0.8285313 0.8303030 0.8330727    0
LMU  0.8002981 0.8077496 0.8154762 0.8215155 0.8375559 0.8464978    0
LMS  0.7974181 0.8007948 0.8088381 0.8092974 0.8157895 0.8236463    0
RFO  0.9996090 0.9996091 0.9996091 0.9996872 0.9998045 0.9998045    0
RFU  0.8837556 0.8912072 0.8986587 0.8983904 0.9002976 0.9180328    0
RFS  0.9478649 0.9498261 0.9553128 0.9541122 0.9572989 0.9602583    0
XGBO 0.9310008 0.9325513 0.9329554 0.9342429 0.9331509 0.9415559    0
XGBU 0.8792846 0.8807750 0.8822653 0.8831932 0.8867362 0.8869048    0
XGBS 0.8838133 0.8932473 0.8946846 0.8979946 0.9071500 0.9110780    0
GLO  0.8205629 0.8256450 0.8269795 0.8298603 0.8330727 0.8430414    0
GLU  0.8077496 0.8166915 0.8214286 0.8188312 0.8226528 0.8256334    0
GLS  0.8022851 0.8023833 0.8048659 0.8051253 0.8068520 0.8092399    0

6.6 Plot of Results

We plot the results to get a deeper understanding which is our best model.

bwplot(results, layout = c(2, 1), scales = list(relation="free"))

Based on this, we realize that the best performing model is the Random Forest with SMOTE sampling.

7. Hyperparameter Tuning

Finally we perform a cross validated, randomized grid search on the chosen random forest in order to define the final model, optimizing for Recall.

grid<-expand.grid(
  mtry=c(16,32,63),
  splitrule=c('extratrees', 'gini'),
  min.node.size=c(1,3,5)
)

tc<-trainControl(
  method = "cv",
  number=5,
  savePredictions = TRUE,
  summaryFunction = prSummary,
  classProbs=TRUE,
  search = "random"
)

ini<-now()

trainIndex <- createDataPartition(train_featured$y, p = .95, list = FALSE)
model_train <- train_featured[ trainIndex,]
holdout  <- train_featured[-trainIndex,]

model_train<-SMOTE(y ~ ., data  = model_train)

grid_rf <- train(
  y ~ .,
  data = model_train,
  method = "ranger",
  num.trees=500,
  tuneGrid = grid,
  trControl = tc,
  metric = "Recall",
  tuneLength = 5,
  verbose = TRUE,
  importance = "impurity",
  preProcess=pp
)

print(now()-ini)
Time difference of 47.0147 mins
grid_rf
Random Forest

27902 samples
   63 predictor
    2 classes: 'yes', 'no'

Pre-processing: centered (63), scaled (63), Yeo-Johnson transformation (63)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 22323, 22321, 22322, 22321, 22321
Resampling results across tuning parameters:

  mtry  splitrule   min.node.size  AUC        Precision  Recall
  16    extratrees  1              0.9739380  0.9133025  0.9504094
  16    extratrees  3              0.9737242  0.9127051  0.9493226
  16    extratrees  5              0.9714476  0.9081152  0.9462284
  16    gini        1              0.9705882  0.9087755  0.9453925
  16    gini        3              0.9709983  0.9082192  0.9448070
  16    gini        5              0.9709856  0.9080662  0.9414623
  32    extratrees  1              0.9663423  0.9062043  0.9514970
  32    extratrees  3              0.9693028  0.9064545  0.9503262
  32    extratrees  5              0.9682875  0.9027077  0.9487373
  32    gini        1              0.9612021  0.9033966  0.9422148
  32    gini        3              0.9630954  0.9037762  0.9415456
  32    gini        5              0.9646612  0.9026009  0.9398731
  63    extratrees  1              0.9601284  0.9022918  0.9487370
  63    extratrees  3              0.9622551  0.9029146  0.9486536
  63    extratrees  5              0.9632128  0.8999184  0.9457268
  63    gini        1              0.9331448  0.9010824  0.9312595
  63    gini        3              0.9356459  0.9008694  0.9298382
  63    gini        5              0.9396563  0.9006608  0.9298383
  F
  0.9314820
  0.9306476
  0.9267747
  0.9267183
  0.9261436
  0.9244547
  0.9282885
  0.9278635
  0.9251417
  0.9223890
  0.9222649
  0.9208522
  0.9249180
  0.9252099
  0.9222444
  0.9159052
  0.9151051
  0.9149929

Recall was used to select the optimal model using the largest value.
The final values used for the model were mtry = 32, splitrule =
 extratrees and min.node.size = 1.

7.1 Final model

And this is the final model:

grid_rf$finalModel
Ranger result

Call:
 ranger::ranger(dependent.variable.name = ".outcome", data = x,      mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size,      splitrule = as.character(param$splitrule), write.forest = TRUE,      probability = classProbs, ...)

Type:                             Probability estimation
Number of trees:                  500
Sample size:                      27902
Number of independent variables:  63
Mtry:                             32
Target node size:                 1
Variable importance mode:         impurity
Splitrule:                        extratrees
OOB prediction error (Brier s.):  0.04479533 

7.3 Variable Importance

Here we see the variable importance for the final model.

plot(varImp(grid_rf), top = 20)

8. Predictions

We predict on the test set, add the predictions to it and create our predictions file.

predicted <- predict(grid_rf, newdata = test_featured)
df_pred<-cbind(test, predicted)
fwrite(df_pred[,c("predicted")],"predictions.csv")

9. Conclusions

We conclude that after extensive feature exploration, feature engineering, and modelling, the best preforming model is a Random Forrest with SMOTE sampling and the parameters showed above. The final sensitivity score on 5% of the data as a holdout is 0.81.

The following are the main considerations and conclusions that we can extract from each part of the process:

  • Exploratory Data Analysis

The initial dataset was highly unbalanced where, for our target variable (term deposit subscrption), 88% of the labels correspond to class NO. Considering this, we tested different sample techniques (oversampling, undersampling and SMOTE) to understan which technique performs better.

A Shiny Dashboard was built to present some plots and basic data, in order to understand the data given in a easier way. R gives this kind of useful tools to manage in a more visual and interactive way all kind of data.

  • Feature Engineering

We created a set of additional features based on the initial dataset. From these new features we can see that for the final model Days to end of month, Quantity of loans, Q3 and Balance General Status, among others were relevant.

Considering that the Random forest algorithm automatically performs feature selection, we can see from the plot above that the most remarkable variable is duration.

Our final feature set remains the same beacuse there was no high significant correlation, low score of dispersion or no linear dependencies between all the features.

  • Modelling

We assume that having a false positive error brings a higher cost than a false negative; so in this case we base our results on the sensitivity metric (True positive rate).

After running a Logistc Regression as baseline model and included new features, we tested 4 different classification techniques:

  • Logistic Regression
  • Logistic Regression with Regularization (LASSO)
  • Random Forest
  • XG Boost

The best Recall score was from the Random Forest model so a cross validated, randomized grid search was applied in order to define the final model and optimize Recall. Different tuning parameters were tested given the best outcome the following ones:
* ‘min.node.size’=1 * ‘mtry’ = 32 * ‘splitrule’ = extratrees * ‘min.node.size’ = 1

The results of the final model were presented with an ROC curve and a plotted confusion matrix.

Finally, the target variable was predicted and on the test set and a .csv file was created to store all the data.